Proteomics is the quantitative study of the proteome, i. e., the whole set of proteins available in a biological sample1. Mass spectrometry (MS) is the current state-of-the-art method to do proteomics analyses2. MS consists of measuring the mass-to-charge ratio (\(m/z\)) of ions3. MS tends to be combined with high-performance liquid chromatography (HPLC)1. Thus, in liquid chromatography-mass spectrometry (LC-MS), ions tend to be identified based on both their chromatographic retention time and \(m/z\). In the case of MS-based proteomics, the ions to be analyzed are ionized peptides generated from enzymatic digestion1.
Proteomics mass spectrometry consists of two steps. First, intact peptide ions (also called precursors) generate a first MS1 spectrum. Then a subset of them is subjected into fragmentation, generating a MS/MS spectrum. The fragmentation tends to break peptide bonds, thus the MS/MS aids into the identification of the peptide sequence2.
There are two main data acquisition and quantification strategies: data-dependent acquisition (DDA) and data-independent acquisition (DIA),1 as shown in Figure 1.1. With the DDA strategy, a precursor is selected from MS1 at a given retention time to go into MS/MS. On the other hand, with the DIA strategy, multiple precursors are selected from a \(m/z\) window of a given retention time to go into MS/MS.
Figure 1.1: Data acquisition strategies: DDA (A) and DIA (B). (Sinha & Mann, 2020)
The usage of DIA seems promising, as it allows the identification and quantification of peptide in an unbiased way4. Besides, with the introduction of the new Thermo Fisher Scientific Inc.® instrument, the Orbitrap Astral mass spectrometer5, which is capable to quantify 5 times more peptides per unit time compared to other mass spectrometers compared to other DIA-based instruments6, the area of proteomics seems to be advancing rapidly.
Nevertheless, there a lot of challenges in the computational aspect of the proteomics field, such as the data size and complexity, the prevalence of closed source software and restrictive licenses, the lack of good documentation, the lack of standardization in proteomics software development and the lack of maintenance of open source software7. For example, the mass spectrometers raw output files are in proprietary, licensed, closed and binary formats8. Given the restrictive-access nature of these files, an open standard was needed, which is why the mzML9 file format was created.
Another well-known problem in current science is the reproducibility crisis10. Specifically, computational reproducibility is achieved when the same input data, code and software environment produce exactly the same output11. Nevertheless, this is not necessarily true across different computational platforms (i. e., different operating systems) due to intrinsic numerical instability12. To address this issue, the use of containers is a popular solution in modern scientific computing11. Containers are a self-contained executable package isolated from the host system13. Currently, the most used containerization technology is Docker14.
On the other hand, bioinformatics analyses are complex, multi-step processes that require the use of multiple specialized software tools in a chained sequence15. Besides the reproducibility issue previously discussed, there are more challenges that the field is facing such as: incompatibilities between software, manual software execution steps, high complexity and file size of omics data, among others16. To tackle these problems, Nextflow12 was developed both as a scientific workflow management system and domain specific language (DSL).
To assess this challenges, a Nextflow-based pipeline was to analyze both DDA and DIA proteomics data. The objective of this notebook is to benchmark the DIA-specific subworfklows (Figure 1.2). The tools integrated into the subworkflow created to convert files from raw to mzML format were MSConvert17 and ThermoRawFileParser18. For DIA, DIA-NN19 processes and workflows were implemented. Docker images were built for all the DIA-NN versions available for Linux at the moment of the pipeline development (1.8.1, 1.9.1, 1.9.2, 2.0, 2.0.1, 2.0.2 and 2.1). Given incompatibilities across different DIA-NN versions, specific subworkflows for DIA-NN v1.8.1 and DIA-NN v2.1 were coded. Note that DIA-NN v2.1 accepts raw files as input, making the mzML conversion an unnecessary step. The DIA-NN steps are integrated as different Nextflow processes. In the end, the pipeline was built as modular as possible.
Figure 1.2: Pipeline DIA-NN subworkflows metromap
DIA-NN is meant to be run in a multi-step pipeline following the next two steps:
Custom Docker images were created individually for most of the pipeline processes; except for the MSConvert conversion (proteowizard/pwiz-skyline-i-agree-to-the-vendor-licenses) and the ThermoRawFileParser conversion (quay.io/biocontainers/thermorawfileparser:1.4.5--h05cac1d_1).
To benchmark the pipeline, samples from Escherichia coli and Homo sapiens were processed and measured by in three different mass spectrometer instruments: FusionLumos, Exploris480 and Astral. Three replicates were used per each combination (i.e. different mass spectrometers and the three species), yielding 18 different raw files. Different pipeline profiles were run considering the DIA-NN versions and raw file files processing strategies in a combinatorial way.
To assess the computational efficiency of the pipeline runs, different metrics were considered and contrasted, such as the pipeline total runtime and the size of the input files. In order to address the pipeline performance regarding biological metrics, the precursor identification and quantification results were measured.
To assess the computational efficiency of the pipeline runs, different metrics were considered and contrasted, such as the pipeline total runtime and the size of the input files. Data, including file size, from server after running
ls -lh */*/data/*raw | tr -s ' ' '\t' > files_list.tsv and then secure copying
it to the local repo root. The units were converted to MB using custom functions.
## Reads TSV
raw_info <- read_tsv("./files_list.tsv",
col_names = FALSE)
## Keeps just the necessary information
raw_info %<>% dplyr::select(c("X5", "X9")) %>%
purrr::set_names(c("size", "fullname"))
## Mutates the tibble
raw_info %<>% mutate(size_MB = get_MB(size)) %>%
separate(
fullname,
into = c("instrument", "organism", "subfolder", "file_name"),
sep = "/"
)
raw_info %>%
select(-file_name) %>%
kbl(
caption = "Size of all raw files",
) %>%
kable_material(
c("striped", "hover"),
full_width = FALSE,
position = "center"
) %>%
scroll_box(
width = "80%",
height = "400px",
fixed_thead = TRUE
)
| size | instrument | organism | subfolder | size_MB |
|---|---|---|---|---|
| 7.2G | Astral | Ecoli | data | 7372.8 |
| 7.2G | Astral | Ecoli | data | 7372.8 |
| 7.2G | Astral | Ecoli | data | 7372.8 |
| 12G | Astral | Hsapiens | data | 12288.0 |
| 12G | Astral | Hsapiens | data | 12288.0 |
| 12G | Astral | Hsapiens | data | 12288.0 |
| 805M | Exploris480 | Ecoli | data | 805.0 |
| 796M | Exploris480 | Ecoli | data | 796.0 |
| 791M | Exploris480 | Ecoli | data | 791.0 |
| 1.5G | Exploris480 | Hsapiens | data | 1536.0 |
| 1.5G | Exploris480 | Hsapiens | data | 1536.0 |
| 1.5G | Exploris480 | Hsapiens | data | 1536.0 |
| 751M | FusionLumos | Ecoli | data | 751.0 |
| 754M | FusionLumos | Ecoli | data | 754.0 |
| 761M | FusionLumos | Ecoli | data | 761.0 |
| 788M | FusionLumos | Hsapiens | data | 788.0 |
| 797M | FusionLumos | Hsapiens | data | 797.0 |
| 787M | FusionLumos | Hsapiens | data | 787.0 |
As shown in Table 2.2, the Astral instrument produces larger raw files than the other two instruments: the median file size was 0.76 GB for the FusionLumos, 1.14 GB for the Exploris480 and 9.6 GB for the Astral.
### Median size by instrument
raw_info %>%
group_by(instrument) %>%
summarise(median(size_MB / 1024)) %>%
kbl(
caption = "Median size by instrument",
digits = 2
) %>%
kable_material(
c("striped", "hover"),
full_width = FALSE,
position = "center"
)
| instrument | median(size_MB/1024) |
|---|---|
| Astral | 9.60 |
| Exploris480 | 1.14 |
| FusionLumos | 0.76 |
Besides, statistics were computed per organism. The E. coli reference proteome (UP000000625) has 4,403 genes, while the H. sapiens (UP000005640) one has 20,598. As shown in Table 2.3, the median raw file size for E. coli was 0.777 GB, while for H. sapiens it was 1.5 GB, almost twice as large.
### Median size by organism
raw_info %>%
group_by(organism) %>%
summarise(median(size_MB / 1024)) %>%
kbl(
caption = "Median size by organism",
digits = 2
) %>%
kable_material(
c("striped", "hover"),
full_width = FALSE,
position = "center"
)
| organism | median(size_MB/1024) |
|---|---|
| Ecoli | 0.78 |
| Hsapiens | 1.50 |
The Figure 2.1 shows the file of the sizes per instrument and organism.
### Plot
ggplot(raw_info, aes(x = instrument, y = size_MB / 1024, colour = organism)) +
geom_beeswarm(size = 2, alpha = 0.7) +
scale_x_discrete(limits = c("FusionLumos", "Exploris480", "Astral")) +
scale_colour_manual(
name = "Organism",
values = org_cols
) +
theme_minimal() +
ylim(0, 12.5) +
labs(
x = "Mass Spectrometer",
y = "Raw file size (GB)"
) +
ggtitle("Size of raw files per experiment")
Figure 2.1: Size of raw files per experiment
To get the total runtimes, the Nextflow workflow reports were used.
Times were summarized after doing grep -m1 'duration: <strong>' */*/*/nf_report.html > runtimes.txt. As expected, the data from Astral takes longer to be analyzed (Table 2.4).
## File parsing
runtimes <- read_delim("runtimes.txt",
delim = " ", ## By some reason grep created 10 whitespaces
col_names = FALSE) %>%
set_names(c("run", "string")) %>%
mutate(time = get_seconds(string) / 60,
instrument = word(run, 1, sep = "/"),
organism = word(run, 2, sep = "/"),
profile = word(run, 3, sep = "/")) %>%
select(!run) %>%
select(!string)
### Median runtime per instrument
runtimes %>%
filter(!str_detect(profile, "1\\.9\\.1")) %>%
group_by(instrument) %>%
summarise(median(time)) %>%
kbl(
caption = "Median runtime per instrument",
digits = 2
) %>%
kable_material(
c("striped", "hover"),
full_width = FALSE,
position = "center"
)
| instrument | median(time) |
|---|---|
| Astral | 36.98 |
| Exploris480 | 10.18 |
| FusionLumos | 9.33 |
Similarly, samples from H. sapiens also take longer (Table 2.4).
### Median runtime per organism
runtimes %>%
filter(!str_detect(profile, "1\\.9\\.1")) %>%
group_by(organism) %>%
summarise(median(time)) %>%
kbl(
caption = "Median runtime per organism",
digits = 2
) %>%
kable_material(
c("striped", "hover"),
full_width = FALSE,
position = "center"
)
| organism | median(time) |
|---|---|
| Ecoli | 4.43 |
| Hsapiens | 22.28 |
The Figure 2.2 shows the total runtimes of the pipeline per instrument and organism.
### Plot
runtimes %>%
filter(!str_detect(profile, "1\\.9\\.1")) %>%
ggplot() +
geom_beeswarm(aes(y = time / 60, x = instrument, colour = organism),
size = 2,
alpha = 0.8,
) +
scale_colour_manual(
name = "Organism",
values = org_cols
) +
geom_boxplot(aes(y = time / 60, x = instrument),
width = .25, alpha = 0,
outlier.shape = NA) +
scale_x_discrete(limits = c("FusionLumos", "Exploris480", "Astral")) +
theme_minimal() +
labs(
x = "Mass Spectrometer",
y = "Total runtime (h)",
title = "Runtime per experiment"
)
Figure 2.2: Total runtimes of the pipeline profiles across experiments
As expected, there is a correlation between the size of the raw files and the total runtime of the pipeline, although this does not seem to be a linear correlation. Indeed, the pattern seems to be exponential, although more data would be needed to confirm that assumption.
raw_summary <- raw_info %>%
group_by(instrument, organism) %>%
summarise(
avg_raw_size = mean(size_MB, na.rm = TRUE),
## Cleans unnecessary metadata
.groups = "drop"
) %>%
## Converts back to GB
mutate(avg_raw_size = avg_raw_size / 1024)
size_time <- left_join(runtimes, raw_summary,
by = c("instrument", "organism"))
size_time %>%
filter(!str_detect(profile, "1\\.9\\.1")) %>%
ggplot(aes(x = avg_raw_size, y = time /60 ,
colour = organism,
shape = instrument)) +
geom_point(size = 3, alpha = 0.8) +
scale_colour_manual(
name = "Organism",
values = org_cols
) +
theme_minimal() +
labs(
x = "Average raw file size (GB)",
y = "Total runtime (h)",
title = "Runtime vs file size"
)
Figure 2.3: Pipeline runtimes vs size of raw files
if (!dir.exists("plots/runtimes")) dir.create("plots/runtimes")
(
size_time %>%
filter(!str_detect(profile, "1\\.9\\.1")) %>%
filter(instrument == "Astral") %>%
ggplot(aes(x = avg_raw_size, y = time /60 ,
colour = organism)) +
geom_point(size = 2, alpha = 0.8) +
scale_colour_manual(
name = "Organism",
values = org_cols
) +
theme_minimal() +
labs(
x = "Average raw file size (GB)",
y = "Total runtime (h)"
) +
ggtitle("Astral runs")
) %>%
ggsave("plots/runtimes/astral_runs.png",. )
(
size_time %>%
filter(!str_detect(profile, "1\\.9\\.1")) %>%
filter(instrument == "Exploris480") %>%
ggplot(aes(x = avg_raw_size, y = time /60 ,
colour = organism)) +
geom_point(size = 2, alpha = 0.8) +
scale_colour_manual(
name = "Organism",
values = org_cols
) +
theme_minimal() +
labs(
x = "Average raw file size (GB)",
y = "Total runtime (h)"
) +
ggtitle("Exploris480 runs")
) %>%
ggsave("plots/runtimes/exploris_runs.png",. )
(
size_time %>%
filter(!str_detect(profile, "1\\.9\\.1")) %>%
filter(instrument == "FusionLumos") %>%
ggplot(aes(x = avg_raw_size, y = time /60 ,
colour = organism)) +
geom_point(size = 2, alpha = 0.8) +
scale_colour_manual(
name = "Organism",
values = org_cols
) +
theme_minimal() +
labs(
x = "Average raw file size (GB)",
y = "Total runtime (h)"
) +
ggtitle("FusionLumos runs")
) %>%
ggsave("plots/runtimes/lumos_runs.png",. )
Given the complexity of the data generated by the Astral instrument with the H. sapiens samples, further tests were conducted with the pipeline runs statistics on this data. As shown in Figure 2.4, defining performance in terms of runtime, DIA-NN 2.1 outperforms all its predecessors, being the fastest version, whereas DIA-NN 1.8 seems to be an underperforming version. In other words, the pipeline using DIA-NN 2.1 is the fastest, while the one with DIA-NN 1.8 is the slowest.
if (!dir.exists("plots/runtimes_tool")) dir.create("plots/runtimes_tool")
deconv_rt <- function(x, ms, org) {
x %>%
filter(
!str_detect(profile, "1\\.9\\.1"),
instrument == ms,
organism == org
) %>%
mutate(
convert_tool = word(profile, 1, sep = "_"),
diann_version = word(profile, 2, sep = "_"),
) %>%
mutate(
diann_version = replace_na(diann_version, "diannv2.1")
) %>%
ggplot(aes(x = convert_tool, y = time / 60, shape = diann_version)) +
geom_beeswarm(size = 3, color = "#18974C") +
theme_minimal() +
labs(
x = "Raw conversion tool",
y = "Total runtime (h)"
) +
ggtitle(paste("Pipeline runtimes for", org, "measured in", ms))
}
ints <- c("Astral", "Exploris480", "FusionLumos")
orgs <- c("Hsapiens", "Ecoli")
for (ms in ints) {
for (org in orgs) {
p <- deconv_rt(size_time, ms, org) +
ylim(0, NA)
f <- file.path("plots/runtimes_tool", paste0(ms, "_", org, ".png"))
ggsave(filename = f, plot = p)
}
}
deconv_rt(size_time, "Astral", "Hsapiens") +
ylim(0, NA)
Figure 2.4: Pipeline total runtimes for H sapiens samples measured in Astral
## Standard QC thresholds on DIA-NN reports
## Extracts high quality peptides
qc <- function(x) {
x %>%
collect() %>%
filter(
## Standard thresholds
Q.Value <= 0.01,
Lib.Q.Value <= 0.01,
Lib.PG.Q.Value <= 0.01,
PG.Q.Value <= 0.05,
### Filters out contaminants
!str_detect(Protein.Group, "contam_sp"),
!str_detect(Protein.Ids, "contam_sp")
) %>%
pull(Precursor.Id) %>%
unique()
}
## Function to perform qc on the report's precursor
## and filter them out from the matrices
clean <- function(report, mat) {
out <- mat %>%
filter(Precursor.Id %in% qc(report)) %>%
## Collapses to precursor level and removes missing values
mutate(
## Converts NAs to 0s
across(c(ends_with("mzML"), ends_with("raw")),
~ replace_na(.x, 0))
) %>%
group_by(Precursor.Id) %>%
mutate(
## Sum intensities per precursors
across(c(ends_with("mzML"), ends_with("raw")),
sum),
) %>%
ungroup() %>%
filter(
## Filters our zeros
!if_any(c(ends_with("mzML"), ends_with("raw")),
~. == 0)
)
}
## Lists parquets
parquet_list <- dir_ls(
path = ".",
recurse = TRUE,
glob = "*report.parquet"
) %>%
## Reads all the parquet files
purrr::set_names() %>%
map(open_dataset, format = "parquet")
## Lists DIA-NN v1.8 reports
diann_list <- dir_ls(
path = ".",
recurse = TRUE,
glob = "*diann-report"
) %>%
## Read DIA-NN v1.8 reports
purrr::set_names() %>%
map(diann_load)
## Concatenates all
all_reports <- c(
parquet_list,
diann_list
)
## Lists matrices
pr_matrixes <- dir_ls(
path = ".",
recurse = TRUE,
regexp = "*pr_matrix.tsv"
) %>%
## Exclude firts passed
str_subset("first-pass", negate = TRUE) %>%
purrr::set_names() %>%
map(read_tsv)
# Performs QC
## Gets <instrument/organism/profile> keys
keys <- all_reports %>%
names() %>%
str_replace("^((?:[^/]+/){2}[^/]+).*", "\\1")
keyed_reports <- all_reports %>%
set_names(
sub("^((?:[^/]+/){2}[^/]+).*", "\\1", names(.) )
)
keyed_matrices <- pr_matrixes %>%
set_names(
sub("^((?:[^/]+/){2}[^/]+).*", "\\1", names(.) )
)
pr_matrixes_cleaned <- imap(
keyed_reports,
~ if(.y %in% names(keyed_matrices))
clean(.x, keyed_matrices[[.y]])
)
# Functions to extract information from DIA-NN outputs
## Get protein groups from parquet DIA-NN output
pg <- function(x) {
x %>%
collect() %>%
pull(Protein.Group) %>%
unique()
}
## Get proteins from parquet DIA-NN output
proteins <- function(x) {
x %>%
collect() %>%
pull(Protein.Ids) %>%
unique()
}
## Get peptides from parquet DIA-NN output
peptides <- function(x) {
x %>%
collect() %>%
pull(Precursor.Id) %>%
unique()
}
pg_list <- c(
pr_matrixes_cleaned %>% map(pg)
)
proteins_list <- c(
pr_matrixes_cleaned %>% map(proteins)
)
peptides_list <- c(
pr_matrixes_cleaned %>% map(peptides)
)
# Auxiliary functions to get Jaccard indexes
## Computes Jaccard Index
jaccard <- function(x, y) {
length(intersect(x, y)) / length(union(x, y))
}
## Commputes jaccard index pairwise of list
jpw <- function(x) {
n <- names(x)
## Expands pairwise
expand_grid(
names1 = n,
names2 = n,
) %>%
## Maps the function fer columns
mutate(
index = map2_dbl(x[names1], x[names2], jaccard)
) %>%
## Comes back to squared datafra,e
pivot_wider(
names_from = names2,
values_from = index
) %>%
column_to_rownames("names1")
}
## Function to automatize heatmap plotting
jaccard_hm <- function(x, string) {
x %>%
jpw() %>%
pheatmap(
main = string,
cluster_rows = TRUE,
cluster_cols = TRUE,
border_color = NA,
color = viridis(100),
angle_col = 45,
fontsize_row = 10,
fontsize_col = 10,
cellwidth = 30,
cellheight = 30
)
}
## Function to automatize upset plotting
jaccard_us <- function(x) {
x %>%
fromList() %>%
upset(nsets = ncol(.),
order.by = "freq",
decreasing = TRUE,
nintersects = 10)
}
pg_hsapiens <- pg_list %>%
{ .[str_detect(names(.), "Hsapiens")] }
pg_hsapiens_astral <- pg_hsapiens %>%
{ .[str_detect(names(.), "Astral")] } %>%
set_names( word(names(.), 3, sep = "/") )
jaccard_hm(pg_hsapiens_astral, "Astral H sapiens | Pairwise Jaccard Index of identified protein groups")
jaccard_us(pg_hsapiens_astral)
pg_hsapiens_exploris <- pg_hsapiens %>%
{ .[str_detect(names(.), "Exploris480")] } %>%
set_names( word(names(.), 3, sep = "/") )
jaccard_hm(pg_hsapiens_exploris, "Exploris480 H sapiens | Pairwise Jaccard Index")
jaccard_us(pg_hsapiens_exploris)
pg_hsapiens_lumos <- pg_hsapiens %>%
{ .[str_detect(names(.), "FusionLumos")] } %>%
set_names( word(names(.), 3, sep = "/") )
jaccard_hm(pg_hsapiens_lumos, "FusionLumos H sapiens | Pairwise Jaccard Index")
jaccard_us(pg_hsapiens_lumos)
pg_ecoli <- pg_list %>%
{ .[str_detect(names(.), "Ecoli")] }
pg_ecoli_astral <- pg_ecoli %>%
{ .[str_detect(names(.), "Astral")] } %>%
set_names( word(names(.), 3, sep = "/") )
jaccard_hm(pg_ecoli_astral, "Astral E coli | Pairwise Jaccard Index")
jaccard_us(pg_ecoli_astral)
pg_ecoli_exploris <- pg_ecoli %>%
{ .[str_detect(names(.), "Exploris480")] } %>%
set_names( word(names(.), 3, sep = "/") )
jaccard_hm(pg_ecoli_exploris, "Exploris480 E coli | Pairwise Jaccard Index")
jaccard_us(pg_ecoli_exploris)
pg_ecoli_lumos <- pg_ecoli %>%
{ .[str_detect(names(.), "FusionLumos")] } %>%
set_names( word(names(.), 3, sep = "/") )
jaccard_hm(pg_ecoli_lumos, "FusionLumos E coli | Pairwise Jaccard Index")
jaccard_us(pg_ecoli_lumos)
prots_hsapiens <- proteins_list %>%
{ .[str_detect(names(.), "Hsapiens")] }
prots_hsapiens_astral <- prots_hsapiens %>%
{ .[str_detect(names(.), "Astral")] } %>%
set_names( word(names(.), 3, sep = "/") )
jaccard_hm(prots_hsapiens_astral, "Astral H sapiens | Pairwise Jaccard Index of identified proteins")
jaccard_us(prots_hsapiens_astral)
prots_hsapiens_exploris <- prots_hsapiens %>%
{ .[str_detect(names(.), "Exploris480")] } %>%
set_names( word(names(.), 3, sep = "/") )
jaccard_hm(prots_hsapiens_exploris, "Exploris480 H sapiens | Pairwise Jaccard Index")
jaccard_us(prots_hsapiens_exploris)
prots_hsapiens_lumos <- prots_hsapiens %>%
{ .[str_detect(names(.), "FusionLumos")] } %>%
set_names( word(names(.), 3, sep = "/") )
jaccard_hm(prots_hsapiens_lumos, "FusionLumos H sapiens | Pairwise Jaccard Index")
jaccard_us(prots_hsapiens_lumos)
prots_ecoli <- pg_list %>%
{ .[str_detect(names(.), "Ecoli")] }
prots_ecoli_astral <- prots_ecoli %>%
{ .[str_detect(names(.), "Astral")] } %>%
set_names( word(names(.), 3, sep = "/") )
jaccard_hm(prots_ecoli_astral, "Astral E coli | Pairwise Jaccard Index")
jaccard_us(prots_ecoli_astral)
prots_ecoli_exploris <- prots_ecoli %>%
{ .[str_detect(names(.), "Exploris480")] } %>%
set_names( word(names(.), 3, sep = "/") )
jaccard_hm(prots_ecoli_exploris, "Exploris480 E coli | Pairwise Jaccard Index")
jaccard_us(prots_ecoli_exploris)
prots_ecoli_lumos <- prots_ecoli %>%
{ .[str_detect(names(.), "FusionLumos")] } %>%
set_names( word(names(.), 3, sep = "/") )
jaccard_hm(prots_ecoli_lumos, "FusionLumos E coli | Pairwise Jaccard Index")
jaccard_us(prots_ecoli_lumos)
pepts_hsapiens <- peptides_list %>%
{ .[str_detect(names(.), "Hsapiens")] }
pepts_hsapiens_astral <- pepts_hsapiens %>%
{ .[str_detect(names(.), "Astral")] } %>%
set_names( word(names(.), 3, sep = "/") )
jaccard_hm(pepts_hsapiens_astral, "Astral H sapiens | Pairwise Jaccard Index of identified precursors")
jaccard_us(pepts_hsapiens_astral)
pepts_hsapiens_exploris <- pepts_hsapiens %>%
{ .[str_detect(names(.), "Exploris480")] } %>%
set_names( word(names(.), 3, sep = "/") )
jaccard_hm(pepts_hsapiens_exploris, "Exploris480 H sapiens | Pairwise Jaccard Index")
jaccard_us(pepts_hsapiens_exploris)
pepts_hsapiens_lumos <- pepts_hsapiens %>%
{ .[str_detect(names(.), "FusionLumos")] } %>%
set_names( word(names(.), 3, sep = "/") )
jaccard_hm(pepts_hsapiens_lumos, "FusionLumos H sapiens | Pairwise Jaccard Index")
jaccard_us(pepts_hsapiens_lumos)
pepts_ecoli <- peptides_list %>%
{ .[str_detect(names(.), "Ecoli")] }
pepts_ecoli_astral <- pepts_ecoli %>%
{ .[str_detect(names(.), "Astral")] } %>%
set_names( word(names(.), 3, sep = "/") )
jaccard_hm(pepts_ecoli_astral, "Astral E coli | Pairwise Jaccard Index")
jaccard_us(pepts_ecoli_astral)
pepts_ecoli_exploris <- pepts_ecoli %>%
{ .[str_detect(names(.), "Exploris480")] } %>%
set_names( word(names(.), 3, sep = "/") )
jaccard_hm(pepts_ecoli_exploris, "Exploris480 E coli | Pairwise Jaccard Index")
jaccard_us(pepts_ecoli_exploris)
pepts_ecoli_lumos <- pepts_ecoli %>%
{ .[str_detect(names(.), "FusionLumos")] } %>%
set_names( word(names(.), 3, sep = "/") )
jaccard_hm(pepts_ecoli_lumos, "FusionLumos E coli | Pairwise Jaccard Index")
jaccard_us(pepts_ecoli_lumos)
## Function to compute cross validations
compute_cv <- function(x) {
x %>%
rowwise() %>%
mutate(
cv = {
c_across(c(ends_with("mzML"), ends_with("raw"))) %>%
log() %>%
{ sd(.)/mean(.) }
}
) %>%
ungroup()
}
##
#pr_matrixes_cleaned %<>% lapply(compute_cv)
## Gets the common peptides detected across all profiles
common_peptds <- purrr::reduce(pepts_hsapiens_astral, intersect)
## Gets data just from Astral and Human
prs_hsapiens_astral <- pr_matrixes_cleaned %>%
{ .[str_detect(names(.), "Hsapiens")] } %>%
{ .[str_detect(names(.), "Astral")] } %>%
## Just common peptides
lapply(filter, Precursor.Id %in% common_peptds) %>%
## Compute cvs
lapply(compute_cv)
## Compute mean
#lapply(compute_means)
## Extracts cvs
cvs <- prs_hsapiens_astral %>%
## Creates tibble with sources and all cv's
imap_dfr(
~ tibble(
profile = .y,
cv = .x$cv
)
) %>%
mutate(
## Deletes unnecessary info
profile = word(profile, 3, sep = "/")
)
## Violin and boxplot
ggplot(cvs, aes(x = profile, y = cv)) +
geom_violin(color = "gray",
fill = "#6CC24A",
alpha = 0.5) +
geom_boxplot(fill = "#18974C",
width = .25,
outlier.shape = NA) +
theme_minimal() +
scale_y_sqrt() +
theme(legend.position = "none",
axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) +
labs(
x = "Workflow",
y = "Coefficient of Variation",
title = "CV's across different workflows | Astral, H sapiens"
)
## Just boxplot
ggplot(cvs, aes(x = profile, y = cv)) +
geom_boxplot(fill = "#18974C",
width = .5,
outlier.shape = NA) +
ylim(0, 0.035) +
theme_minimal() +
theme(legend.position = "none",
axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) +
labs(
x = "Workflow",
y = "Coefficient of Variation",
title = "CV's across different workflows | Astral, H sapiens"
)
## Warning: Removed 35422 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Stats plots
cvs %>%
group_by(profile) %>%
summarise(mean_cv = mean(cv)) %>%
ggplot(aes(x = reorder(profile, mean_cv), y = mean_cv)) +
geom_col(fill = "#18974C", color = "black") +
labs(title = "Mean CV per workflow",
x = "Profile",
y = "Mean CV") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))
cvs %>%
group_by(profile) %>%
summarise(median_cv = median(cv)) %>%
ggplot(aes(x = reorder(profile, median_cv), y = median_cv)) +
geom_col(fill = "#18974C", color = "black") +
labs(title = "Median CV per workflow",
x = "Profile",
y = "Median CV") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))
cvs %>%
group_by(profile) %>%
summarise(sd_cv = sd(cv)) %>%
ggplot(aes(x = reorder(profile, sd_cv), y = sd_cv)) +
geom_col(fill = "#18974C", color = "black") +
labs(title = "Standard Deviation CV per workflow",
x = "Profile",
y = "SD CV") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))
### All with peptides identified with newer DIA-NN versions
list_pepts_no_diann1_8 <- pepts_hsapiens_astral[c(
"diannv2.1",
"msconvert_diannv1.9.2",
"msconvert_diannv2.0",
"msconvert_diannv2.0.1",
"msconvert_diannv2.0.2",
"thermoraw_diannv1.9.2",
"thermoraw_diannv2.0",
"thermoraw_diannv2.0.1",
"thermoraw_diannv2.0.2"
)]
## Intersection off all sets
pepts_no_diann1_8 <- Reduce(intersect, list_pepts_no_diann1_8)
## Intersection of set of dia-nn 1.8
pepts_diann1_8 <- c(
pepts_hsapiens_astral$msconvert_diannv1.8.1,
pepts_hsapiens_astral$thermoraw_diannv1.8.1
)
## Proteins identified by all versions but DIA-NN 1.8
pepts_unid <- setdiff(pepts_no_diann1_8, pepts_diann1_8)
## Flags if the precursor was identified by DIA-NN 1.8 or not in the precrusos matrix
all_prs_mat <- pr_matrixes_cleaned %>%
{ .[str_detect(names(.), "Hsapiens")] } %>%
{ .[str_detect(names(.), "Astral")] } %>%
map(
~.x %>%
mutate(is_unid = Precursor.Id %in% pepts_unid)
)
## Pivots everythin into a single object
all_long <- map_dfr(
all_prs_mat,
~ .x %>%
pivot_longer(
cols = c(ends_with("mzML"), ends_with("raw")),
names_to = "sample",
values_to = "intensity"
),
.id = "tibble_id"
)
## Plots all densities of all samples together
ggplot(all_long, aes(x = log(intensity),
color = !(is_unid),
group = interaction(sample, !(is_unid)))) +
geom_density(alpha = 0.7) +
theme_minimal() +
labs(
title = "Intensity densities of all Astral H. sapiens runs",
x = "log(Intensity)",
y = "Density",
color = "Identified by DIA-NN 1.8"
)
Driven by curiosity,
### All with proteins identified with newer DIA-NN versions
list_prots_no_diann1_8 <- prots_hsapiens_astral[c(
"diannv2.1",
"msconvert_diannv1.9.2",
"msconvert_diannv2.0",
"msconvert_diannv2.0.1",
"msconvert_diannv2.0.2",
"thermoraw_diannv1.9.2",
"thermoraw_diannv2.0",
"thermoraw_diannv2.0.1",
"thermoraw_diannv2.0.2"
)]
## Intersection off all sets
prots_no_diann1_8 <- Reduce(intersect, list_prots_no_diann1_8)
## Union of set of dia-nn 1.8
prots_diann1_8 <- c(
prots_hsapiens_astral$msconvert_diannv1.8.1,
prots_hsapiens_astral$thermoraw_diannv1.8.1
)
## Proteins identified by all versions but DIA-NN 1.8
prots_unid <- setdiff(prots_no_diann1_8, prots_diann1_8)
## All proteins identified with all versions
all_prots <- unlist(prots_hsapiens_astral) %>%
unique()
## Mart
mart <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
## Convert ID's
all_prots_ids <- getBM(
attributes = c("uniprotswissprot", "entrezgene_id"),
filters = "uniprotswissprot",
values = all_prots,
mart = mart
) %>%
pull("entrezgene_id") %>%
as.character()
interest_ids <- getBM(
attributes = c("uniprotswissprot", "entrezgene_id"),
filters = "uniprotswissprot",
values = prots_unid,
mart = mart
) %>%
pull("entrezgene_id") %>%
as.character()
## GO Enrichment Analysis
### Biological Process
go_bp <- enrichGO(
gene = interest_ids,
universe = all_prots_ids,
OrgDb = org.Hs.eg.db,
ont = "BP",
pAdjustMethod = "BH",
pvalueCutoff = 0.05,
qvalueCutoff = 0.05,
readable = TRUE
)
### Molecular Function
go_mf <- enrichGO(
gene = interest_ids,
universe = all_prots_ids,
OrgDb = org.Hs.eg.db,
ont = "MF",
pAdjustMethod = "BH",
pvalueCutoff = 0.05,
qvalueCutoff = 0.05,
readable = TRUE
)
### Cellular Component
go_cc <- enrichGO(
gene = interest_ids,
universe = all_prots_ids,
OrgDb = org.Hs.eg.db,
ont = "CC",
pAdjustMethod = "BH",
pvalueCutoff = 0.05,
qvalueCutoff = 0.05,
readable = TRUE
)
## All ontologies
go_all <- enrichGO(
gene = interest_ids,
universe = all_prots_ids,
OrgDb = org.Hs.eg.db,
ont = "ALL",
pAdjustMethod = "BH",
pvalueCutoff = 0.05,
qvalueCutoff = 0.05,
readable = TRUE
)
## Plots
barplot(go_mf)
dotplot(go_mf)
sessionInfo()
## R version 4.3.3 (2024-02-29)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS 15.5
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/Mexico_City
## tzcode source: internal
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] kableExtra_1.4.0 org.Hs.eg.db_3.18.0 AnnotationDbi_1.64.1
## [4] IRanges_2.36.0 S4Vectors_0.40.2 Biobase_2.62.0
## [7] BiocGenerics_0.48.1 clusterProfiler_4.10.1 biomaRt_2.58.2
## [10] diann_1.0.1 paletteer_1.6.0 viridis_0.6.5
## [13] viridisLite_0.4.2 pheatmap_1.0.12 UpSetR_1.4.0
## [16] ggbeeswarm_0.7.2 fs_1.6.6 magrittr_2.0.3
## [19] arrow_19.0.1 lubridate_1.9.4 forcats_1.0.0
## [22] stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2
## [25] readr_2.1.5 tidyr_1.3.1 tibble_3.2.1
## [28] ggplot2_3.5.1 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] splines_4.3.3 bitops_1.0-9 ggplotify_0.1.2
## [4] filelock_1.0.3 polyclip_1.10-7 XML_3.99-0.17
## [7] lifecycle_1.0.4 lattice_0.22-6 vroom_1.6.5
## [10] MASS_7.3-60.0.1 sass_0.4.9 rmarkdown_2.28
## [13] jquerylib_0.1.4 yaml_2.3.10 cowplot_1.1.3
## [16] DBI_1.2.3 RColorBrewer_1.1-3 zlibbioc_1.48.2
## [19] ggraph_2.2.1 RCurl_1.98-1.16 yulab.utils_0.1.7
## [22] tweenr_2.0.3 rappdirs_0.3.3 GenomeInfoDbData_1.2.11
## [25] enrichplot_1.22.0 ggrepel_0.9.6 tidytree_0.4.6
## [28] svglite_2.2.1 codetools_0.2-20 DOSE_3.28.2
## [31] xml2_1.3.6 ggforce_0.4.2 tidyselect_1.2.1
## [34] aplot_0.2.3 farver_2.1.2 BiocFileCache_2.10.2
## [37] jsonlite_1.8.9 tidygraph_1.3.1 systemfonts_1.2.3
## [40] tools_4.3.3 progress_1.2.3 treeio_1.26.0
## [43] ragg_1.3.3 Rcpp_1.0.13 glue_1.8.0
## [46] gridExtra_2.3 xfun_0.48 qvalue_2.34.0
## [49] GenomeInfoDb_1.38.8 withr_3.0.2 fastmap_1.2.0
## [52] fansi_1.0.6 digest_0.6.37 timechange_0.3.0
## [55] R6_2.5.1 gridGraphics_0.5-1 textshaping_0.4.0
## [58] colorspace_2.1-1 GO.db_3.18.0 RSQLite_2.3.9
## [61] utf8_1.2.4 generics_0.1.3 data.table_1.16.2
## [64] prettyunits_1.2.0 graphlayouts_1.2.0 httr_1.4.7
## [67] scatterpie_0.2.4 pkgconfig_2.0.3 gtable_0.3.5
## [70] blob_1.2.4 XVector_0.42.0 shadowtext_0.1.4
## [73] htmltools_0.5.8.1 bookdown_0.43 fgsea_1.28.0
## [76] scales_1.3.0 png_0.1-8 ggfun_0.1.6
## [79] knitr_1.48 rstudioapi_0.17.1 tzdb_0.4.0
## [82] reshape2_1.4.4 nlme_3.1-166 curl_5.2.3
## [85] cachem_1.1.0 parallel_4.3.3 vipor_0.4.7
## [88] HDO.db_0.99.1 pillar_1.9.0 grid_4.3.3
## [91] vctrs_0.6.5 dbplyr_2.5.0 beeswarm_0.4.0
## [94] evaluate_1.0.1 cli_3.6.3 compiler_4.3.3
## [97] rlang_1.1.4 crayon_1.5.3 labeling_0.4.3
## [100] rematch2_2.1.2 plyr_1.8.9 stringi_1.8.4
## [103] BiocParallel_1.36.0 assertthat_0.2.1 munsell_0.5.1
## [106] Biostrings_2.70.3 lazyeval_0.2.2 GOSemSim_2.28.1
## [109] Matrix_1.6-5 RcppEigen_0.3.4.0.2 hms_1.1.3
## [112] patchwork_1.3.0 bit64_4.5.2 KEGGREST_1.42.0
## [115] highr_0.11 igraph_2.1.1 memoise_2.0.1
## [118] bslib_0.8.0 ggtree_3.10.1 fastmatch_1.1-4
## [121] bit_4.5.0 ape_5.8 gson_0.1.0